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The BlueGene/L Supercomputer* 

Gyan Bhanot a , Dong Chen a , Alan Gara a and Pavlos Vranas a 

a IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, USA 

The architecture of the BlueGene/L massively parallel supercomputer is described. Each computing node 
consists of a single compute ASIC plus 256 MB of external memory. The compute ASIC integrates two 700 
MHz PowerPC 440 integer CPU cores, two 2.8 Gflops floating point units, 4 MB of embedded DRAM as cache, 
a memory controller for external memory, six 1.4 Gbit/s bi-directional ports for a 3-dimensional torus network 
connection, three 2.8 Gbit/s bi-directional ports for connecting to a global tree network and a Gigabit Ethernet 
for I/O. 65,536 of such nodes are connected into a 3-d torus with a geometry of 32x32x64. The total peak 
performance of the system is 360 Teraflops and the total amount of memory is 16 TeraBytes. 
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1. INTRODUCTION 

IBM has previously announced a multi-year ini- 
tiative to build a petaflops scale supercomputer 
for computational biology research [|l] . The Bluc- 
Gene/L machine is a first step in this program . 
It is based on a different and more general ar- 
chitecture than the original BlueGene announce- 
ment. In particular, BlueGene/L uses embedded 
PowerPC processor cores developed by IBM Mi- 
croelectronics H for ASIC products. 

The lattice community has seen great success 
of some special-purpose, massively parallel ma- 
chines dedicated for QCD, for example, the com- 
bined 1 Teraflops QCDSP machine Q and the 
follow-on 20 Teraflops QCDOC machine [^| cur- 
rently being developed at Columbia University in 
collaboration with IBM Research. 

The BlueGene/L design philosophy has been 
influenced by QCDSP and QCDOC. Contrast- 
ing to the current commercial approach of build- 
ing large-scale supercomputers by clustering gen- 
eral purpose yet complex SMP systems, BG/L 
leverages IBM's system-on-a-chip silicon technol- 
ogy and builds a large parallel system consist- 
ing of more than 65,000 nodes, yet at a signifi- 
cantly lower price/performance and power con- 
sumption/performance versus conventional ap- 
proaches. Compared to QCDOC, BG/L will be 
using a newer generation of IBM's silicon tech- 
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nology, with enhancements in single node perfor- 
mance as well as a more general network sup- 
porting hardware point-to-point message passing 
(cut-through routing). This makes BG/L suitable 
for a wide variety of applications. 

The BlueGene/L project is a jointly funded 
research partnership between IBM and the 
Lawrence Livermore National Laboratory 
(LLNL) as part of the US Department of Energy 
ASCI Advanced Architecture Research Program. 
The main research and development effort is 
centered at the IBM T.J. Watson Research Cen- 
ter, with supports from IBM Enterprise Server 
Group and IBM Microelectronics. Application 
performance and scaling studies have been ini- 
tiated with partners at a number of academic 
and government institutions, including the San 
Diego Supercomputing Center and the California 
Institute of Technology. 

A large machine with a peak performance of 
360 Teraflops is anticipated to be built at the 
LLNL in the 2004-2005 time frame. A smaller, 
100 Teraflops machine is expected to be built at 
the IBM T.J. Watson Research Center for com- 
putational biology studies. 

2. BG/L OVERVIEW 

BlueGene/L is a massively parallel, scalable 
system. A single parallel job can use up to 65,536 
compute nodes. The system is configured as a 



Figure 1. Building BlucGcnc/L. A node ASIC contains 2 CPUs. 2 node ASICs along with their associated 
local external memory are built onto a compute card. A node board contains 16 compute cards. 32 node 
boards are then plugged into both sides of 2 mid-plane boards in a cabinet. A large BG/L system contains 
64 cabinets, forming a 32x32x64 torus of compute nodes, with a peak performance of 180/360 Tflops. 



32x32x64 three-dimensional torus of compute 
nodes. Each node consists of a single compute 
ASIC and memory. Each node can support up to 
2 GB of local memory. Balancing cost and ap- 
plication requirements, our current plan calls for 
256 MB of DDR-SDRAM per node. 

The ASIC will be manufactured on IBM CMOS 
CU-11 0.13 micron copper technology with an ex- 
pected 11.1 mm 2 die size. This is the next genera- 
tion IBM CMOS technology compared to the one 
used by QCDOC. There are two PowerPC 440 
32 bit integer CPU cores in each BG/L ASIC, 
each core connects to a "double" floating point 
unit capable of 2 fused floating point multiply- 
adds per CPU clock cycle. At a target frequency 
of 700 MHz, each core can achieve a peak per- 
formance of 2.8 Gflops. In normal operations, we 
expect that one CPU will be mainly doing compu- 



tation while the other one will be busy handling 
communications. However, for certain kinds of 
applications, if the communication requirement 
are small compared to the amount of compute, 
or if there are separate computation and commu- 
nication steps, then both cores can be utilized for 
compute, leading to 5.6 Gflops peak performance 
per node. We therefore quote the performance 
per node as 2.8/5.6 Gflops. 

Figure |l| shows the steps of building the Blue- 
Gene/L supercomputer. 2 compute ASIC chips 
along with their associated local memory are put 
onto a compute card. A node board will have 
16 compute cards. A cabinet includes 2 mid- 
planes. A total of 32 node boards are plugged 
into both sides of the 2 mid-planes. Within a 
cabinet, the compute nodes form a geometry of 
8x8x16 (8x8x8 per mid-plane) with a peak per- 
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formance of 2.9/5.7 Tfiops and a total of 256 GB 
memory. The BlueGene/L system consists of 64 
cabinets connected as a 32 x 32 x 64 torus. The to- 
tal peak performance is 180/360 Tfiops and the 
total amount of memory is 16 TB. The system 
will occupy an area of about 250 m 2 , and the to- 
tal power is estimated at approximately 1 MW. 

In addition to the computing nodes in the sys- 
tem, there are also I/O nodes. Each I/O node 
contains the same ASIC as in a compute node, 
but with 512 MB of memory. These additional 
I/O cards are plugged into the same node board 
with compute cards. An I/O node connects to a 
number of compute nodes through a custom high 
speed network, and to outside host and disk farm 
through a Gigabit Ethernet. These I/O nodes are 
used to offload the work required for disk I/O and 
host communications from the compute nodes. 
We plan to install one I/O node per 64 compute 
nodes. The maximum ratio is one I/O node for 
every 8 compute nodes. 

Besides the active 64 cabinets, the large BG/L 
system also includes a number of spare cabinets, 
Gigabit Ethernet switches racks, disk I/O racks 
plus a host computer. The Gigabit switches con- 
nect to the BG/L I/O nodes, the host computer 
and the disk farm. 

3. BG/L ARCHITECTURE 

The architecture of the BG/L node ASIC is 
shown in Figure |[ Each ASIC integrates two 
PowerPC 440 cores, each with a PowerPC 440 
FP2 core, 2 small L2 buffers, 4 MB embedded 
DRAM configured as L3 cache, DDR-SDRAM 
memory controller for connecting to external 
memory, custom designed high speed torus and 
tree network logic, a Gigabit Ethernet and a 
JTAG control interface. 

3.1. PowerPC core 

The 440 is a standard 32-bit integer micropro- 
cessor core product from IBM Microelectronics. 
The same core for a previous generation technol- 
ogy is used in the QCDOC project. Each core 
has 32 general purpose registers. It has integrated 
32 KB instruction LI cache and 32 KB data LI 
cache. They provide one cycle access from the 



CPU. There are three 128 bit buses coming out 
of the core, one each dedicated for data read, data 
write and instruction load. This core supports all 
common PowerPC instructions as well as instruc- 
tions defined in the PowerPC Book E standard for 
embedded processors. 

The 440 FP2 core is an enhanced "double" 64- 
bit floating-point unit. It consists of a primary 
and a secondary unit, each of which is a com- 
plete FPU with their own register sets. The pri- 
mary FPU supports standard PowerPC floating 
point instructions. It can do a single fused 64 bit 
multiply-add in one processor cycle. Through a 
SIMD like instruction extension, both FPU units 
can be utilized to execute 2 fused 64 bit multiply- 
adds per cycle. In addition, a separate float- 
ing point load/store operation can be executed in 
parallel to the multiply- adds. The load/store unit 
supports 128 bit "quad- word" load/store from ei- 
ther the LI cache or the 440 memory interface, 
to a pair of registers, one each from the two FP 
units. This increased load/store bandwidth to 
FP registers is to match the increased floating 
point performance. It is also used to efficiently 
move data to and from the custom high-speed 
network interfaces, as the 32 bit integer unit does 
not have the adequate bandwidth to its internal 
registers to support high speed data movements. 
The floating point instruction extension is also 
powerful enough to allow the exchange of register 
contents between the two FP units while they are 
executing multiply-adds, without extra instruc- 
tion overheads. 

3.2. Memory Subsystem 

Each of the 440 processor core is directly con- 
nected to a small 2KB L2 cache, then to a shared 
L3 directory which controls 4 MB of embedded 
DRAM as the L3 cache. The L3 controller di- 
rectly connects to a DDR-SDRAM controller for 
external memory. Both the 4 MB L3 cache and 
the external DDR-SDRAM are ECC protected. 

Because the 440 hardware does not support 
SMP protocols for its LI cache, the LI cache of 
the two processors are not coherent. We have 
therefore implemented a lock box and a small fast 
multi-port SRAM behind the two L2 caches to 
facilitate processor-to-processor communications. 



Figure 2. BlucGcnc/L node ASIC. Each ASIC integrates two 32 bit PowerPC 440 integer CPU cores, 2 
"double" floating point FP2 cores, L2 cache buffers, 4 MB EDRAM as L3 cache, DDR-SDRAM controller 
for external memory, high speed torus network logic, global tree network logic, Gigabit Ethernet and 
JTAG control interface. 



The L2 and L3 caches are coherent for both pro- 
cessors. 

A data prefetch engine is built into each L2 
cache to reduce the latency for sequential data ac- 
cesses from L3 and external memory. The mem- 
ory subsystem is being designed for low latency, 
high bandwidth accesses to cache and memory. 
An L2 hit returns in 6 to 10 processors cycles, an 
L3 hit in about 25 cycles, and an L3 miss (loading 
from external DRAM) in about 75 cycles. Various 
memory bandwidth numbers are listed in Table | 
as they are compared to QCDOC. 

The L2 caches are also directly connected to a 
set of FIFOs to allow for fast and high bandwidth 
access to both a 3-d torus network and a global 
tree network. 

3.3. 3-D Torus Network 

The 3-d torus is a high speed network used for 
general-purpose, point-to-point message passing 



operations. Each ASIC has 6 torus links build in. 
On a compute node, these 6 links are connected 
to its 6 nearest neighbors, in the +x, —x, +y, —y, 
+z and — z directions, respectively. There is one 
link between a pair of nearest neighbors. Each 
link is a bi-directional serial connection with a 
target speed of 1.4 Gbit/s per direction. 

Figure || illustrates the torus logic. Within a 
node ASIC, 2 sets of injection FIFOs and 2 sets 
of reception FIFOs are directly connected to the 
L2 cache. These FIFOs along with 6 input links 
and 6 output links are then connected through 
a global cross-bar switch. A packet injected on 
one node will route through the 3-d torus net- 
work in hardware without any software overhead 
until it reaches its destination, where it will then 
be pulled out by a CPU from a reception FIFO. 
A packet could vary in size from 32 bytes to 
256 bytes in 32-byte granularity. Even though in 



Figure 4. Adaptive routing of the torus network. 
A packet could take different routes to reach its 
destination. 



Virtual channels are used to in the torus net- 
work to avoid dead- locks || . A token based flow 
control mechanism is used to improve throughput 



0. The torus switch is highly pipelined and im- 
plements virtual cut-through routing In ad- 
dition, the torus network supports both adaptive 
and deterministic minimal-path routing |p|,|Io[. 
Figure || illustrates adaptive routing. Packets 
could take different paths to reach their destina- 
tion as long as the paths are minimal, i.e., every 
hop a packet makes should reduce its distance to 
the destination. The actual path a packet goes 
through is determined dynamically on each node 
depending on local traffic. All these features com- 
bined optimize both the throughput and latency 
over the torus network. 

In addition to the point-to-point message pass- 
ing, we have also implemented multicast opera- 
tions in the torus hardware to allow for one node 
to broadcast a message to a "class" of nodes. This 
feature has been proven to be very useful for a 
number of applications. 

All data and token flow-control packets are pro- 
tected by CRC. If an error happens on a link, the 
bad packet that got received will be automatically 
deleted from the network, and the good packet 
will be retransmitted over the same link. The 
error and retransmission protocol is handled au- 
tomatically by the torus hardware. 

3.4. Global Tree Network 

On a large scale parallel machine like Bluc- 
Gene/L, one constraint that affects the scalability 
of a large class of applications is the time it takes 
for a global reduction operation, for example, a 
global sum. 

To reduce the latency of global reduction op- 
erations, the BG/L supercomputer implements a 
global tree network. This is a binary-tree like net- 
work shown in Figure |^. The global tree network 
supports global broadcast from a single node to 
all nodes, and global reduction operations includ- 
ing global integer maximum, global integer sum 
and certain binary operations like global AND, 
OR and XOR. The tree network logic is also in- 
tegrated into the node ASIC. 

During a global reduction, each node con- 
tributes a message. All messages from all nodes 
are combined into a single message and are then 
broadcast to each contributing nodes. A single 
round-trip latency on the global tree network is 



Figure 6. Illustration of partitioning of BG/L. 
A cabinet of nodes can either be connected in a 
torus loop, or be jumped over where the cabinet 
will form its own torus in a separate physical par- 
tition. 



Figure ^ illustrates the basic idea of BG/L 
partitioning. Within a BG/L cabinet, there are 
2 mid-planes, each with 512 compute nodes at- 
tached forming a 8 x 8 x 8 3-d lattice. On the 
edge of each mid-plane, there is a set of re-drive 
chips. They capture the high speed torus and tree 
network signals coming from node ASICs over 
a certain length of mid-plane board trace, then 
re-drive them over the cable connecting different 
cabinets. This improves the high speed signal 
quality. These re-drive chips can be programmed 
by the host to either include the nodes of the cur- 
rent cabinet in the torus loop for a large partition, 
or to skip over them, therefore creating a separate 
physical partition. 

This scheme allows each partition to be electri- 
cally isolated. Each partition has its own torus 
and tree networks that do not communicate with 
other partitions. With a few more spare cabinets 
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Figure 7. BlueGcne/L system expected to be 
built at the LLNL. 



Cougar kernels at Sandia National Laboratory 
and the University of New Mexico. 

The BG/L compute kernel is a single user OS 
that supports execution of a single dual-threaded 
(one thread for each of the two processors within 
a node) user process. It provides a single and 
static virtual address space to the one running 
compute process. There is no need for context 
switching. Because the PowerPC 440 processor 
core supports large pages, demand paging is not 
necessary. The user process will receive full re- 
source utilization, yet the OS is still protected 
from the user application through a virtual mem- 
ory system. 

I/O nodes will support multiple processes. 
They will only execute system software, no user 
applications. They provide I/O support to com- 
pute nodes, i.e., file operations to the disk farm, 
control and monitoring support for the host, etc. 

In terms of development tools and environ- 
ment, IBM's XL-C, XL-C++ and XL-FORTRAN 
compilers are expected to be ported to support 
both the PPC 440 integer unit as well as the 
FP2 floating point unit. The GNU-C compiler 
already supports the common set of PowerPC in- 
structions, albeit it does not support the FP2 
instruction extensions, which the IBM compilers 
are expected to support. A set of highly opti- 
mized math libraries is also expected to be pro- 
vided to facilitate high performance application 
development. As for parallel environment, we an- 
ticipate to provide MPI support as well as a user 
level, low latency system programming interface 
to facilitate the utilization of the high speed torus 
and tree networks. 



5. APPLICATIONS 



4. SOFTWARE 

Scalable system software that supports efficient 
execution of parallel applications is an integral 
part of the BlueGene/L architecture. Our plan 
is to have a lightweight high-performance kernel 
running on compute nodes, and to expect Linux 
running on I/O nodes. The lightweight compute 
kernel approach was motivated by the Puma and 



The BG/L team has been doing extensive stud- 
ies of a wide variety of applications [pLl|| . The re- 
sults of these studies are very encouraging. They 
show that a large class of applications scales well 
on the BG/L architecture, even to the level of 
65,536 nodes. 

In this section, for the interest of the lattice 
community, we will compare the architecture of 
BG/L to QCDOC, as applied to QCD type ap- 
plications. Reference JT3J shows that for Wilson 
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fermions, on a 2 4 lattice per node, QCDOC can 
sustain about 50% of the peak performance. 

Table 1 

Comparisons between QCDOC and BlueGene/L. 



QCDOC BlueGene/L 



CMOS Technology 


0.18 ixm 


0.13 [im 


CPU 


500 MHz 


700 MHzx2 


EDRAM size 


4 MB 


4 MB 


Peak flops/node 


1 Gflops 


5.6 Gflops 


Network Topology 


6-d torus 


3-d torus 


Bandwidth: 






LI cache to FPU 


4 GB/s 


22.4 GB/s 


CPU interface peak 


8 GB/s 


11.2 GB/s 


CPU read sustained 


3 GB/s 


8.4 GB/s 


EDRAM interface 


16 GB/s 


22.4 GB/s 


External DRAM 


2.6 GB/s 


5.6 GB/s 


Torus/link 


0.5 Gbit/s 


1.4 Gbit/s 



Table [l| shows a comparison of various per- 
formance and memory bandwidth numbers be- 
tween QCDOC and BG/L. For Lattice QCD ap- 
plications, because most of the communications 
are between nearest neighbors, we expect that 
both processors on a BG/L node will be avail- 
able for computation. From QCDOC to BG/L 
on a per node basis, the sustained memory read 
bandwidth from EDRAM improves by a factor 
of 2.8, the external memory bandwidth by a fac- 
tor of 2.15 and the torus network bandwidth by 
a factor of 2.8. While the peak floating perfor- 
mance is increased from 1 Gflops to 5.6 Gflops, we 
expect the overall performance improvement per 
node will be determined mainly by the memory 
bandwidth, therefore about a factor 2.15 to 2.8. 
Given the high efficiency of QCDOC, we expect 
BG/L to also perform well for QCD applications. 

6. CONCLUSION 

System-on-a-chip technology has opened new 
possibilities of building large scale supercomput- 
ers. By optimizing the overall system, great 
reduction in cost /performance, power consump- 
tion/performance and machine size/performance 
can be achieved. 



BlueGene/L is a first step in IBM's commit- 
ment to pctaflops scale computing by exploring 
a new architecture of building massively paral- 
lel machines. With more applications ported and 
optimized for this kind of architecture, there will 
be great scientific benefits and rewards. 
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